Diverse types of expertise in facial recognition

Facial recognition errors can jeopardize national security, criminal justice, public safety and civil rights. Here, we compare the most accurate humans and facial recognition technology in a detailed lab-based evaluation and international proficiency test for forensic scientists involving 27 forensic departments from 14 countries. We find striking cognitive and perceptual diversity between naturally skilled super-recognizers, trained forensic examiners and deep neural networks, despite them achieving equivalent accuracy. Clear differences emerged in super-recognizers’ and forensic examiners’ perceptual processing, errors, and response patterns: super-recognizers were fast, biased to respond ‘same person’ and misidentified people with extreme confidence, whereas forensic examiners were slow, unbiased and strategically avoided misidentification errors. Further, these human experts and deep neural networks disagreed on the similarity of faces, pointing to differences in their representations of faces. Our findings therefore reveal multiple types of facial recognition expertise, with each type lending itself to particular facial recognition roles in operational settings. Finally, we show that harnessing the diversity between individual experts provides a robust method of maximizing facial recognition accuracy. This can be achieved either via collaboration between experts in forensic laboratories, or most promisingly, by statistical fusion of match scores provided by different types of expert.

For the face memory tests, 5 out of the 7 super-recognizers scored more than 2 standard deviations (SD) above the mean on the CFMT+ (i.e., > 93%) 4 , and the remaining 2 scored 1.7SDs above the mean. Only 2 of the 7 scored above 1.7 SD on the CFMT-Aus 5 , while 6 of 7 scored above 1.7 on the UNSW Face Test 6 . Table S1. Participant's accuracy on each of the tests in the lab-based assessment. Values in boldface indicate performance deviating more than 1.7 SDs from the mean of the normative data.

Extended Group-Level Analysis of Super-Recognizers vs. Norms on Lab-Based Tests
Super-recognizers' individual and group scores on each test are shown against the normative scores in the main manuscript Figure 7. . This finding shows that super-recognizers can achieve high levels of accuracy after only a short exposure, whereas forensic examiners' expertise takes longer and appears contingent on a slower method of comparison. This finding points to differences in the perceptual processes underlying the expertise of superrecognizers and forensic examiners.  1,000,000 and above The observations provide extremely strong support to the proposition that it is the same person relative to the proposition that it are different persons.

+4
10,000-1,000,000 The observations provide very strong support to the proposition that it is the same person relative to the proposition that it are different persons.

+3
100-10,000 The observations provide strong support to the proposition that it is the same person relative to the proposition that it are different persons.

10-100
The observations provide support to the proposition that it is the same person relative to the proposition that it are different persons.

2-10
The observations provide weak support to the proposition that it is the same person relative to the proposition that it are different persons.
The observations support neither the proposition that it is the same person nor the proposition that it are different persons.

-1
The observations provide weak support to the proposition that it are not the same persons relative to the proposition that it is the same person.

-2
The observations provide support to the proposition that it are not the same persons relative to the proposition that it is the same person.

-3
The observations provide strong support to the proposition that it are not the same persons relative to the proposition that it is the same person.

-4
The observations provide very strong support to the proposition that it are not the same persons relative to the proposition that it is the same person.

-5
The observations provide extremely strong support to the proposition that it are not the same persons relative to the proposition that it is the same person.
* Likelihood ratios corresponding to the inverse (1/X) of these values (X) will express the degree of support for the specified alternative compared to the first proposition.

Extended AUC analyses
Performance on the proficiency test was compared using Area Under the ROC Curve (AUC) in a one-way ANOVA with Group (novices, super-recognizers, forensic examiners, DNNs, laboratories) as the between subjects factor. There was a significant effect of Group (F(4,183) = 26.5, p < .001, ŋp 2 = .37) which we followed up with planned comparisons. All groups had significantly higher AUC than novices (novices vs.

Completion time analyses
Test completion times are shown in Figure S2. No test time completion data was recorded for one forensic examiner and one forensic laboratory so they were excluded from this analysis. Test completion times for participants who completed the test online were recorded by the testing software (18 super-recognizers, 65 novices). Test completion times were estimated by the remaining participants who completed the test using the raw images and static response document (19 super-recognizers, 15 forensic examiners, 41 novices, 18 forensic laboratories).
To assess the equivalence of measured and estimated completion times, we conducted an independent-samples t-tests between the measured and estimated completion times of super-recognizers and novices.

Correct, Incorrect & Inconclusive decisions of forensic examiners and super-recognizers
To explore the decisional strategies of forensic examiners and super-recognizers in greater detail, we examined their proportion of correct, incorrect and inconclusive responses, by categorising responses of -1 to -5 as "same person" decisions and responses of 1 to 5 as "different person" decisions.
We found that super-recognizers and forensic examiners used the response scale differently (see Figure 3A). While super-recognizers and examiners made similar proportions of correct (83.6% vs. 79.4%) and incorrect (16.1% vs. 11.6%) decisions, forensic examiners responded "inconclusive" (9.1%) far more often than superrecognizers, who almost never responded "inconclusive" (0. This stark difference in "inconclusive" responses suggests that forensic examiners deliberately avoid making identification decisions in some instances. The authors note that in high-stakes real-world forensic practice, forensic examiners routinely declare some comparisons "inconclusive" to avoid making errors that could have profound life-changing consequences for the people involved, especially when the comparison involves poorquality imagery or large age differences. Indeed, Norell, et al. 9 observed that forensic examiners were more likely to respond "inconclusive" as image quality decreased. However, there it was unclear if forensic examiners' increased tendency to respond "inconclusive" reflected an underlying sensitivity to which cases are more likely to result in errors (i.e. a strategic conservatism), or whether it simply reflected a generalised conservatism applied uniformly across comparisons that reduces the number of errors and correct decisions to a similar extent.
To investigate this question, we compared the proportion of forensic examiners who declared each of the 20 comparisons as "inconclusive" to super-recognizers' accuracy on those comparisons. We found a strong negative correlation (Spearman's ρ = -.68, p = .001), such that the more a comparison was declared "inconclusive" by forensic examiners, the worse super-recognizers performed. And, considering only comparisons where participants made a "same person" or "different people" decision i.e. did not respond "inconclusive"; see 10

Sensitivity & Criterion
We calculated sensitivity using d′ (see Figure S4). We conducted a one-way ANOVA on the sensitivity data with Group (super-recognizers, forensic examiners, novices) as the between-subjects factor. There was a  To calculate criterion we classified responses of 1 to 5 as "match" responses, and responses of -5 to -1 as "non-match" responses (see Figure S4). Responses of 0 were excluded from analysis of criterion. We conducted a one-way ANOVA on the criterion data with Group (super-recognizers, forensic examiners, novices) as the between-subjects factor. There was a significant effect of Group [F(4, 177) = 3.72, p = .006, ŋp 2 = .08] which we followed up with planned comparisons. Super-recognizers showed a marginally significant stronger response bias than forensic examiners We also compared each groups' criterion values to 0, which indicates a neutral response bias, using onesample t-tests. Super-recognizers (t(36) = 3.98, p < .001, Cohen's d = 0.65) and novices (t(105) = 4.29, p < .001, Cohen's d = 0.42) had a significant response bias to say "same person". Forensic examiners did not show a significant response bias (t(15) = 0.22, p = .828, Cohen's d = 0.06).
We expected police super-recognizers might show a less pronounced response bias to say "same person" given their awareness of the serious real-world consequences of misidentifications. However, the same-person bias was much stronger for police super-recognizers (M = -2.11) than for civilian super-recognizers [M = -0.83; t(35) = 1.93, p = .062, Cohen's d = 0.82].

Agreement of facial similarity judgements
DNNs showed a high level of agreement with other DNNs for both "same person" pairs (average ρ = 0.53) and "different person" pairs (average ρ = 0.65), as indicated by the cluster of red pixels in the top right-hand corners of Figures 5A and 5B. Similarly, humans tended to agree with each other for "same person" pairs (average ρ = 0.18) and "different person" pairs (average ρ = 0.29), indicating that despite differences in how humans arrive at their judgments they converge on relatively similar assessments of facial similarity.
For "same person" pairs ( Figure 5A), forensic examiners (average ρ = 0.42) and forensic laboratories (average ρ = 0.44) show higher levels of agreement within their groups than super-recognizers (average ρ = 0.18) and novices do (average ρ = 0.15). For "different person" pairs (see Figure 5B), forensic examiners (average ρ = 0.50), forensic laboratories (average ρ = 0.42), super-recognizers (average ρ = 0.44) and novices (average ρ = 0.23) show similar levels of agreement within their groups. The high level of agreement among forensic examiners and forensic laboratories for "same person" pairs may be a consequence of forensic practitioners' training to 'harmonise' their responses, i.e. for different practitioners examining the same image pair to arrive at the same point on the response scale. Greater agreement in responses across members of professional groups is often taken to indicate greater objectivity of forensic face identification methods, and so is perceived as desirable by the forensic science community [11][12][13][14][15][16] .

Fusion analyses
We examined the benefits of fusing decisions from small groups of face identification experts. To do this, we randomly sampled sets of responses made by groups of 2 and 3 individuals 1000 times, computed average responses of each set to each image pair, and then calculated the accuracy of the collective decisions made by the set using AUC (see main text Figure 6). To analyse whether the fused responses improved accuracy, we performed planned Wilcoxon rank sum one-tailed tests predicting that each level of fusion groups would be more accurate than the smaller fusion or individual response counterparts.
Replicating previous work [17][18][19] , we find that all fusion pairs (i.e., 2 x novices, 2 x super-recognizers, 2 x forensic examiners) showed significant improvements in accuracy relative to individual decisions from the same group (novices: W = 35970, p < .001; super-recognizers: W = 11758, p < .001; forensic examiners: W = 5772, p = .028). The best fusion results however were achieved from fusion human decision makers with DNNs. Table S10 shows the median and deviation in AUC achieved with fusion of each DNN with examiners and super-recognizers. We also find that all fusion triplets (i.e., 3 x novices, 3 x super-recognizers, 3 x forensic examiners) showed significant improvements in accuracy compared to the fusion pair counterparts (novices: W = 416052, p < .001; super-recognizers: W = 403195, p < .001; forensic examiners: W = 424256, p < .001). Further analysis of DNN fusions show combination of DNNs that produced the highest overall AUC (Table  S10). Examination of the gains in performance from DNN fusion (Table S11) and correlation in similarity scores for image pairs (Table S12) suggests that gains from fusion is predicted by weaker correlations in similarity ratings. This relationship is confirmed by a significant negative correlation between system gains in AUC and correlation in similarity scores, r(45) = -0.42, p = .004.   Table S14. Median AUCs for individuals, and fused pairs and triplets (data plotted in Figure 6